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1999 > 2024: What happened to parsing? 


+¢ / 
In the early history of NIP these structures [parses] 
were an intermediate step toward deeper lanquage 


processing, 


1999 > 2024: What happened to parsing? 


+¢ / 
In the early history of NIP these structures [parses] 
were an intermediate step toward deeper lanquage 


processing, 


In modern NLP, we don’t generally make explicit use 


of parse or other structures inside the neural language 
models [...], or directly in applications like those we 


discussed in Part 71. 


- Jurafsky & Martin 
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lf your end goals are 
chat, MT, etc., then 


do not build a parser. 


(Of course, if your end goal is to 


have a parser, then build a parser. 


But likely you will do this with an 
LLM anyway.) 
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What are the “parsers” of computer vision? 


¢ Before we answer this question, take a step back 


e We need to understand what our real tasks are 


¢ Let’s study the rise of LLMs 


¢ What lessons can we learn? 


Why are LLMs so successful? 


¢ Realtasks are text generation tasks 


High societal and economic value 
Universality: can do POS tagging, parsing, 
etc. as text generation (if you really want to) 


NLP circa 2024 


Summary of Contents 


1 Introduction.............cc cece cece eee e ee cea cceceeees 1 
I Words 19 
2 Regular Expressions and Automatag, .. @.... a ....... 21 
3 Morphology and Finite-State Transd@ieiim:.......... 57 
4 Computational Phonology and T; 0- CID: 6s is'eressieversseie 91 
5 Probabilistic Models of Pronun nd Spelling ...... 139 
6 3 lal i sia'6.p's/aavorsieisicieisiaisie 189 
7 HMMs and Speech 0 Sa 233 
II Syntax 283 
8 Word Classes and Par peech Tagging ............... 285 
9 Context-Free Grammars Por English ................008 319 


ibliography 
Index 923 


NLP circa 1999 


Why are LLMs so successful? 


¢ Realtasks are text generation tasks 
* High societal and economic value 
¢ Universality: can do POS tagging, parsing, 
etc. as text generation (if you really want to) 


¢ Text generation is all you need 
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Why are LLMs so successful? 


¢ Real tasks are text generation tasks 

* High societal and economic value 

¢ Universality: can do POS tagging, parsing, 
etc. as text generation (if you really want to) 


Text generation is all you need 
But wait, there’s more! 


What if I told you: 


High-quality data is abundant 
The test task = the training task 
Training is basically supervised 
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The LLM miracle 


¢ Test task = training task (... maybe also data? ©) 

¢ Massive high-quality data 

¢ Supervised training 

¢ All for what people truly care about: text generation 
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¢ This basically never happens, and it should blow our minds 
¢ Most CV/ML: sad proxy loss/tasks, small data, low annotation quality, ... 
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pragmatics 


What are the real tasks of computer vision? 
¢ Caveat: focusing on recognition-y things, not 3D-y things 


© Q1: Who benefits from CV? 
e A1: Those who cannot see V 


¢ People with low or no vision 
¢ Real task = open-ended visual QA about the real world 

¢ In general, Al agents operating in visual environments 
* Real tasks = 


¢ Robots: “cook me dinner”, “fold my laundry”, “wash my dishes”, ... 
¢ Internet agents: “shop for my climbing shoes at a good sale price”, ... 


What are the real tasks of computer vision? 


¢ Real tasks = Viz 
¢ Open-ended visual QA about the real world 
¢ Robot: “cook me dinner”, ... 
Aa] 
¢ Internet agent: “shop for me”, ... pada 


¢ Traditional CV tasks (classification, detection, segmentation, ...) 
are not required a priori 


¢ Helpful intermediates or The Wrong Way (i.e., parsers)? 


What are the real tasks of computer vision? 


° Q2: What is the most scientifically important direction? 
¢ A2: Algorithms that learn with human-like data constraints 


¢ Constraints fee 
¢ Ego-centric video alas 
¢ ~24M frames in 18 months (12 hours / day, 1 FPS) 
¢ Limited embodied control 
¢ Observations have very different statistics cf. web data 
¢ Long term, this is probably way more important 
¢ It doesn’t require a data miracle 
¢ But it’s hard, we have no gradient, few people 
work on it, and maybe not needed (birds vs. airplanes) acd 


Jayaraman and Smith 


What are the “parsers” of computer vision? 


¢ Identify your real tasks (I gave 2) 


e General VQA and embodied vision 
¢ Learning with human-like data constraints 


¢ Given your real tasks, fake tasks (“parsers”) are 
those subproblems that are not helpful for solving 
your real tasks 
¢ This definition is relative to your choice of real tasks 


Let’s take general VQA and embodied vision 


¢ The system needs a vast array of skills 
¢ Scene text reading (OCR) 
¢ Object recognition (classification) 
Object delineation (detection, segmentation) 
Chart / infographic parsing 
¢ Document parsing 
¢ Instrument reading 
¢ Place recognition 
¢ Action recognition 
*¢ Face recognition 
¢ World knowledge 


Your Real Tasks 


Let’s take a = and embodied vision 


¢ The system needs a vast NO, skills 
¢ Scene text reading (O 
Object recogritiogh ication) 

Object a ection, segmentation) 


Chart / infog Phan 
paper 
Instru ding 
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: NO ocosntion 
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What’s the “LLM miracle” story here? 


¢ Hypothesis: scaling vision backbones plus LLMs (simple LLaVA- 
style model) will effectively “solve” the general VOA / embodied 
vision tasks to the extent that LLMs “solve” any problem 


¢ But, the big open question is what data to use and how to get it 


¢ It’s not as clean & easy as for LLMs, but | think it’s possible 


Reflections on 
detectors as 
parsers 


¢ | started working on object 
detection in 2008 and 
worked on it until ~2022 


¢ The “why?” didn’t matter for 
a long time 


e Detection didn’t work at all, it 
was interesting to make 
anything work at all 
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Was working on detection a mistake? 


e Absolutely not Science vs not science 
¢ Science is iterative and noisy TS 
: : : ; . Not science 
¢ Asingle step to the optimum Is magic ‘\ 


Science 
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¢ We gained knowledge alongthe way .- 


¢ Our modern techniques build on the 
knowledge we gained 


« But to advance, we must iterate 
e Datasets must evolve 
e Jasks must evolve 


Takeaway 1 


¢ Be skeptical of ideas taken for granted for the last 50-60 years 


¢ Example: object detection 


| want to solve open-ended, real-world QA that powers embodied agents 
Making better object detectors is 100% the wrong direction for this 
achieving this goal, imo 

It’s too limited, too brittle, too data constrained, etc. 


Takeaway 2 


¢ Answer “what are the real tasks?” for yourself 
° We need diversity of perspectives, most will be wrong 
¢ It’s very dull when everyone thinks in the same way 
° Your fake tasks are defined relative to your real tasks 


¢ Examples: 
e General VQA and embodied vision 
¢ Learning with human-like data constraints 


¢ What are yours? 


Takeaway 3 


¢ Trying to solve a scientifically interesting problem with no other 
motivation than gaining knowledge can be extremely fruitful 


¢ Example: object detection 
¢ Key early success of deep learning in CV, after classification 
¢ Moved the CV community into deep learning, convinced many people 


R-CNN: Regions with CNN features 
ae Sa i region sf aeroplane? no. 


er ~ > person? y 
IN : 
Ma ‘“Sivmonitor? no. | 
1. Input 2. Extract region 3. Compute 4. Classify 
image proposals (~2k) CNN features regions 


R-CNN paper from CVPR 2014 


Thank you! 


Data curation makes this supervised learning 


g.t.labels: The brown fox jumps over 
r t+ ft. F 4 
Py P2 P3 Pa Ps + 
g.t. prefixes: 


<S> The brown fox jumps 


¢ In self-supervised learning the “self”, i.e. the model/data, is all you need 
¢ Instead, careful data curation -BY HUMANS - is king ed 


° We pick sentences (g.t. prefixes and labels) to maximize downstream perf 
¢ This process is nothing more than highly efficient batch labeling 
° Ok, this point is mostly a semantic quibble 


